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1.  Introduction 

The  goal  of  this  work  is  to  exploit  image  analysis  techniques  to  solve  image  segmentation  and 
image  retrieval  problems.  Here  we  propose  two  segmentation  techniques;  the  first  method 
involves  an  automated  threshold  selection  method,  while  the  second  is  a  region  based 
segmentation  method  where  the  intensity  inhomogeneity  is  modeled  as  a  linear  combination  of 
the  basis  learned  from  the  images  present  in  the  dataset.  The  image  classification  method 
implemented  here  is  based  on  the  multiple-kernel  based  dictionary  learning  technique. 

A  wide  variety  of  threshold  selection  techniques  for  image  segmentation  exist  in  the 
literature.  Conventional  methods  pick  a  threshold  that  minimizes  the  overlap  or  maximizes  the 
variance  in  between  two  adjacent  intensity  histogram  clusters,  based  on  certain  statistical  metric. 
We  propose  an  alternative  threshold  selection  in  the  context  of  a  GMM  where  the  threshold 
selection  becomes  a  two-step  process:  first,  estimate  the  Gaussian  distribution  parameters  (mean, 
standard  deviation,  and  weight)  by  fitting  the  mixture  of  Gaussian  that  best  approximates  the 
normalized  histogram;  and  then,  calculate  thresholds  based  on  the  estimated  Gaussian 
parameters.  Here,  we  employ  a  learning-based  optimization  approach  using  learning  automata 
for  estimating  Gaussian  mixture  parameters.  The  randomness  in  the  generation  of  parameter 
estimates  provides  learning  automaton  algorithms  with  obvious  advantages  over  the  gradient- 
based  and  expectation  maximization  (EM)  algorithms.  We  approach  the  Gaussian  parameter 
estimation  problem  with  the  so-called  continuous  action  reinforcement  learning  automata 
(CARLA)  [1]  algorithm.  In  particular,  the  output  of  each  automaton  corresponds  to  one  specific 
Gaussian  parameter  to  be  estimated.  Their  combination  at  each  stage  provides  a  fitted  Gaussian 
mixture  model  of  the  original  image  histogram.  The  match  between  the  two  is  evaluated  based 
on  some  metric,  e.g.,  the  average  mean  square  error,  and  is  returned  as  a  reinforcement  signal,  to 
be  applied  as  the  input  to  each  automaton.  Inside  a  learning  automaton,  a  probability  distribution 
function  (PDF)  is  defined  over  the  parameter  search  range,  representing  the  desirability  of  each 
specific  value,  and  is  updated  as  learning  proceeds  based  on  the  reinforcement  signal.  A 
parameter  estimate  an  automaton  output)  is  randomly  generated  for  the  next  stage  based  on  the 
current  probability  distribution. 

The  second  approach  for  segmentation  employs  a  novel  region  based  segmentation  technique 
using  dictionary  learning.  We  hypothesize  that  in  problems  where  a  set  of  training  images  for  the 
object  is  available  for  analysis,  each  image  can  be  compactly  represented  as  a  linear  combination 
of  similar  images  or  a  basis  learned  from  the  training  images  and  segmentation  accuracy  can  be 
significantly  improved  by  computing  the  basis  functions  instead  of  specifying  them  implicitly. 
The  salient  idea  of  this  approach  is  to  compute  the  optimal  set  of  functions  to  model  the  region 
intensities.  Our  solution  to  this  problem  involves  the  integration  of  a  level  set  segmentation 
methodology  with  the  dictionary  learning  framework.  This  provides  a  solution  to  deal  with 
intensity  inhomogeneity  prevalent  in  many  imaging  applications  such  as  ultrasound  and 
fluorescence  microscopy.  The  primary  objective  of  this  approach  is  to  develop  a  region  based 
segmentation  framework,  which  is  capable  of  handling  intensity  inhomogeneity.  It  may  be 
debated  that  adding  edge  information  to  the  region  based  framework  can  improve  segmentation. 
However,  extracting  accurate  edge  map  is  a  challenging  issue  by  itself  for  applications  where  the 
signal  is  poor  due  to  presence  of  speckle  and  clutter.  Therefore,  it  is  of  considerable  interest  to 
develop  a  solely  region  dependent  technique  that  can  accommodate  artifacts  such  as  noise, 
clutter  and  illumination  variation.  This  approach  employs  the  dictionary  learning  technique  to 
obtain  the  basis  function  (learn  from  the  training  data)  that  approximates  the  image  region 
intensities.  This  can  be  viewed  from  the  perspective  of  low  dimensional  approximation  of  a 
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signal.  While  Chan-Vese’s  [2]  method  is  a  form  of  extreme  dimensionality  reduction  (due  to 
piecewise  constant  assumption),  out  method  achieves  a  balance  between  reduction  of 
dimensionality  and  accurate  intensity  modeling. 

The  third  approach  deals  with  a  method  for  image  classification  problem.  Significant 
development  in  sparse  representation  based  classification  methods  [3]  [4]  [5]  [6]  [7]  has  come 
about  in  the  past  few  years.  In  [4],  it  was  shown  that  a  test  image  can  be  represented  as  a  linear 
combination  of  only  a  few  of  the  training  images  and  is  best  represented  when  the  test  image  is  a 
linear  combination  of  images  belonging  to  the  same  class.  However,  it  has  been  shown  in  [5]  [7] 
[6]  [3],  instead  of  using  a  pre-specified  basis  for  representation  of  the  test  sample,  learning  the 
bases  from  the  training  data  itself  has  proved  to  be  more  effective  in  data  representation.  But  the 
dictionary  learning  presented  in  these  papers  is  based  on  a  linear  model  in  feature  space.  The 
main  problem  with  this  approach  is  that  they  cannot  capture  the  non-linearity  present  in  the  data. 
Kernel  learning  methods  are  widely  used  to  deal  with  non-linearity  in  the  feature  space.  A  non¬ 
linear  transform  is  performed  on  the  feature  space  to  convert  the  data  to  a  higher  dimensional 
subspace  and  then  a  linear  classifier  model  can  be  applied.  In  a  classification  system  the  choice 
of  feature  extracted  from  the  image  is  of  crucial  importance.  However,  a  single  feature  cannot 
discriminatively  represent  all  the  images  in  the  dataset.  Here,  we  develop  an  information 
theoretic  kernel  combination  method  embedded  in  the  dictionary  learning  framework.  One 
advantage  of  the  kernel  space  representation,  other  than  a  higher-dimensional  representation,  lies 
in  the  fact  that  different  features  can  be  combined  in  the  kernel  space  using  multiple  kernel 
functions.  Our  method  learns  a  dictionary  in  the  kernel  space  for  each  of  the  classes  in  the 
learning  phase.  We  employ  a  mutual  information  based  approach  to  obtain  the  most  desirable 
weights  for  kernel  combination.  In  the  testing  phase,  our  method  exploits  the  learned  dictionaries 
and  the  kernel  weights  to  assign  a  class  label  to  the  test  image. 

In  this  report  we  provide  a  background  for  each  of  the  methods  implemented  in  solving  the 
above  mentioned  image  analysis  tasks.  Then  we  describe  in  details  the  methodologies  for  the 
threshold  selection,  region  based  segmentation  and  the  image  classification  problem.  We  provide 
relevant  experimental  results  for  each  of  these  approaches  and  discuss  the  future  directions  for 
these  applications. 

2.  Background 

2.1.  Level  set  Segmentation 

The  level  set  method,  used  as  a  numerical  technique  to  track  shapes  can  be  applied  to  image 
segmentation  techniques.  Image  segmentation  solely  based  on  the  gray  values  of  image  pixels 
was  first  proposed  by  Mumford  and  Shah  [8].  This  region  based  approach  was  made  popular 
later  by  Chan  and  Vese  [9]  who  used  level  sets  to  propagate  geometric  snakes  to  segment  the 
image  into  constant  intensity  regions.  The  Chan-Vese  framework  proposes  to  partition  the  image 
f(x)(x  G  fl  <=  M2)  into  sets  of  constant  intensity  regions.  The  optimal  partition  is  obtained  by 
trying  to  minimize  the  following  energy  functional: 

(  |f(x)-c1|2m1(x)dx  +  (  |f(x)  -  c2|2m2(x)  dx  (1) 

JQ.  JQ. 

Here  0  is  a  level  set  function  whose  zero  level  set  denotes  the  object  boundary.  0  is  constructed 
such  that  its  value  is  positive  inside  the  zero  level  contour  and  negative  outside.  The  local 
minimizer  0*  of  (1)  partitions  the  image  such  that  the  two  region  intensities  are  best 
approximated  by  the  constant  scalars  cx  and  c2,  which  are  updated  iteratively  using  alternating 
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minimization,  m x  (x)  =  He  (0)  is  the  regularized  version  of  the  standard  Heaviside  function,  the 
extent  of  regularization  being  controlled  by  the  parameter  e,  m2(x)  =  1  —  m1(x). 

Often  in  practical  scenarios  a  constant  intensity  model  fails  to  capture  the  intensity 
inhomogeneity.  To  adapt  the  model  in  (1)  to  capture  heterogeneous  intensity  regions,  the  scalar 
clt  c2  are  replaced  with  smooth  functions  cx  (x)  and  c2(x)  respectively  and  the  smoothness  is 
enforced  by  constraining  the  total  variations  of  these  functions. 


2.2.  Dictionary  Learning 


Sparse  coding  can  be  efficiently  utilized  by  representing  an  image  (or  a  signal)  as  a  linear 
combination  of  some  basis  vectors.  This  can  be  written  as  F  =  DY,  where  F  is  an  image  (or  a  signal), 
D  is  a  matrix  in  which  columns  represent  the  basis  vectors,  which  we  call  dictionary,  and  Y  contains 
the  representative  sparse  codes.  Given  a  set  of  training  data,  the  goal  of  dictionary  learning  [10]  [11] 
is  to  compute  a  set  of  basis  vectors,  also  called  atoms,  such  that  each  training  data  can  be  represented 
as  a  linear  combination  of  only  a  few  of  these  atoms.  The  key  idea  is  to  utilize  the  underlying  sparsity 
of  the  training  data,  while  minimizing  the  reconstruction  error.  If  F  =  [f1, ... ,  fw]  denotes  the  set  of  N 
training  images,  we  can  use  dictionary  learning  technique  to  compute  the  dictionary 
Dk  =  [d1, ... ,  dk]T  by  solving  the  following  optimization  problem 

i  i  2 


Dk  =  argmin  |fi  -  DTyil 


D,yi 


such  that  1 1  y  i  1 1 Q  <  0,  Vi  =  l,...,N.  (2) 

where  yL  is  a  coefficient  vector  corresponding  to  the  ith  training  image  and  6  is  a  scalar  which 
dictates  the  level  of  sparsity.  There  are  a  number  of  methods  in  the  literature  that  use  some 
approximation  to  solve  the  hard  optimization  problem.  Dictionary  learning  exploits  sparsity  in  the 
data  (2)  by  constraining  l  0  norm  of  the  coefficients. 


2.3.  Multiple  Kernel  Learning 

To  deal  with  non-linearity  in  the  data,  the  kernel  approach  is  applied  which  non-linearly 
transforms  a  data  to  a  higher  dimensional  space.  As  proposed  in  [12]  [13]  [14]  [15],  linear 
learning  methods  can  be  generalized  to  non-linear  learning  methods  by  this  non-linear 
transformation.  Let  <fi:RN  ->  OF,  be  the  non-linear  transformation  that  transforms  the  features  to 
a  higher  dimensional  kernel  space  referred  to  as  reproducing  kernel  Hilbert  space  (RKHS).  In 

this  space,  the  data  Y  =  \y1,y2, . ,ywJ  can  now  be  represented  as 

0(7)  =  [0(yi),0(y2), . ,0(y;vc)].  A  kernel  function  K(.),  yields  the  inner  product  of  any 

given  two  vectors  in  RKHS  such  that  Kiy^y^  =  (p(ydT 4>(yt).  The  kernel  function  K(.)  gives 
a  measure  of  the  similarity  in  a  reproducing  kernel  Hilbert  space  and  in  RKHS  can  be 
represented  as  a  linear  combination  of  other  valid  kernels  where  the  linear  coefficients  are  non¬ 
negative  such  that: 

£ 

X  =  ^  |3SKS  (3) 

S=  1 

where  X  is  the  kernel  function  obtained  from  a  linear  combination  of  different  kernel  functions 
(for  £ different  features).  We  define  the  linear  combination  of  the  kernel  matrix 

%  =  2f=i  PSKS(.  )>  such  that  T'C  )tTj(.  )  =  X.  T*  is  defined  as  the  non-linear  transform  from 
feature  space  to  RKHS. 
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2.4.  Mutual  Information: 


Mutual  information  between  two  random  variables  provides  a  measure  of  dependence  on  one 
another.  Higher  the  mutual  information  greater  is  the  dependency.  A  relevance  measure  between 
features  and  the  class  they  belong  to  can  be  obtained  by  maximizing  the  mutual  information  [16] 
[17]  [18]  [19].  For  a  given  feature  y  the  mutual  information  between  the  feature  and  its  class 
T(y)  =  i  is  given  by  (15). 

KyJ(y)  =  i)  =  H(Q-H(i\y)  (4) 

where  H  (t)  is  the  entropy  given  by: 

u(y)  =  v(y)  l°g(^))  (5) 

For  a  given  class,  the  class  entropy  H(c) is  constant.  Thus  maximizing  the  mutual  information 
between  a  feature  and  a  class  would  mean  minimizing  the  conditional  entropy  H(c\x  ),  where 
the  conditional  entropy  is  given  by 

*(cw=^iogy|y  (6) 

In  the  above  equation,  p(jx)  =  Yfb=i  p(x  \c  )p(c). 

The  mutual  information  provides  a  measure  of  a  feature  belonging  to  a  class.  Thus  minimizing 
the  conditional  entropy  ensures  a  more  compact  distribution  of  the  data  belonging  to  the  same 
class  or  in  other  words  increases  the  discrimination  between  class  distributions  [20].  If  /  denotes 
the  function  transforming  the  given  feature  vector  to  the  RKHS,  we  can  re-write  the  entropy  for 
the  transformed  feature  as  H(c\f(x)). 

2.5.  Image  features 

Scale-invariant  feature  transform  (SIFT):  The  SIFT  algorithm  looks  for  important  pixels  to 
summarize  the  entire  image.  Once  the  algorithm  determines  the  important  pixels  (i.e.,  the 
descriptors),  then  it  computes  local  gradient  histograms  around  each  descriptor  to  form  a  128 
dimensional  features  for  each  descriptor.  More  details  can  be  found  in  [21]. 

Histograms  of  oriented  gradients  (HOG):  The  HOG  algorithm  forms  small  overlapping  image 
patches  from  a  given  image  and  then  for  each  patch,  it  computes  local  gradient  based  histograms. 
More  details  on  HOG  can  be  found  in  [22], 

2.6.  Learning  automata 

Inside  a  learning  automaton,  a  probability  distribution  function  (PDF)  is  defined  over  a 
parameter  search  range,  representing  the  desirability  of  each  specific  value  and  is  updated  as 
learning  proceeds  based  on  reinforcement  signal.  A  parameter  estimate  (an  automaton  output)  is 
randomly  generated  for  the  next  stage  based  on  the  current  probability  distribution. 

A  typical  work  flow  of  a  learning  automaton  is  shown  in  Figure  1:  Basic  learning  procedure  of 
learning  automata,  and  is  described  as  follows: 

1.  At  time  n,  the  automaton  generates  an  action  a(n)  from  the  current  probability 
distribution  Picci),  i  G  (1,2, ... ,  r],  over  the  action  set,  i.e.,  a(n )  G  {alt  a2, ... ,  ar},  where 
P(a;)  denotes  the  probability  that  at  is  picked  as  the  current  action  to  be  applied  to  the 
environment.  Note  that  E[=1  P(cti)  =  1. 

2.  Action  set  is  a(n )  applied  to  the  environment. 

3.  The  environment  reacts  to  the  action  a(n),  and  its  performance  is  evaluated  to  produce  a 
response  /?(n)  G  {/?lf/?2  .../?m}. 
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4.  Based  on  this  response,  Piat )  is  updated  for  i  G  {1,2, ... ,  r}. 

5.  Steps  1—4  are  repeated  until  the  process  converges  or  some  stopping  criterion  is  satisfied 
Different  probability  updating  procedures  distinguish  a  variety  of  learning  automata  algorithms. 
Their  performance  and  convergence  properties  have  been  documented  in  [23]. 


Continuous  action  reinforcement  learning 

automata  (CARLA):  This  learning 
automaton  proposed  in  [1]  extends  the 
discrete  structure  of  a  typical  learning 
automaton  to  the  case  when  the  action  set, 
as  well  as  the  response  from  the 
environment,  is  characterized  by  a 


Figure  1:  Basic  learning  procedure  of  learning  automata  continuous  variable.  Accordingly,  the 

discrete  probability  distributions  of  the 
action  set  are  replaced  by  PDFs  over  a  domain,  on  which  a  potential  action  is  defined.  It  is  used 
as  a  function  optimizer,  and  fits  in  the  image  segmentation  problem  under  consideration  well. 
The  action  by  the  automaton  corresponds  to  a  parameter  used  by  the  specific  segmentation 
algorithm,  i.e.,  the  Gaussian  parameters.  For  each  segmentation  parameter,  there  is  a 
corresponding  automaton.  The  environment  is  the  input  image  histogram  to  be  processed  by  the 
algorithm. 

The  response  is  a  certain  criterion  (e.g.,  root  mean  square  error  between  the  image  histogram 
and  its  model  as  a  fitted  Gaussian  mixture)  on  the  segmentation  performance  with  the  current 
segmentation  parameters  generated  from  the  PDFs  of  the  automata.  CARLA  is  a  linear 
reward/inaction  learning  scheme  [23].  The  basic  steps  of  CARLA  is  as  follows: 

1.  With  a  current  response  /?(n)  at  the  nth  iteration,  the  PDF  p(x;  n  +  1)  of  the  parameter 
to  be  estimated  is  updated  by  adding  a  Gaussian  G(x;a(n))  centered  at  a(n)  with  a 
certain  height  A  and  standard  deviation  a,  i.e., 


G(x;  a(n))  =  A  exp 


(x  —  a(ri)y 


2  a2 


X  ^  >  Xmax} 


(7) 


where  ( xmin ,  xmax)  designates  the  search  range  of  the  parameter  under  consideration,  a 
and  A  are  tuned  with  two  free  parameters  gw  and  gh  based  on  this  range  according  to 

&  =  dw&max  Xmin)>  and  A  =  ~  ~  (8) 

xmax  xmin 


2.  The  PDF  is  then  updated  as  follows  for  x  E  {xmin ,  xmax) 


p(x;n  +  1)  =  cs(n  +  i)[p(x,n)  +  f3(ri)G{x,  a(n))]  (9) 


Where,  cs(n  +  1)  is  a  scaling  constant  to  ensure  p(x,n  +  1)  is  a  PDF. 

3.  Suppose  a  cost  or  performance  index  of  J(n)  is  returned.  Current  and  previous  costs  are 
stored  in  a  reference  set  R(n).  The  median  and  minimum  values  Jmed  and  Jmin  are 
calculated,  and  /?  (n)  is  defined  as 

Pin)  =  max  jo;  Jmed  ~/(n)|  (10) 

Jmed  Jmin ) 

4.  The  parameter  estimate  a(n  +  1)  at  the  next  time  instant  is  randomly  picked  according  to 
the  updated  PDF  p(x;n  +  1).  In  particular,  a  random  number  r(n  +  1)  within  [0,1]  is 
generated  from  a  uniform  distribution,  and  a(n  +  1)  is  picked  such  that 
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ra(n+ 1) 

I  -p{x,  n  +  1)  dx  =  r(n  +  1) 

xmin 

5.  The  learning  procedure  continues  until  a  prescribed  criterion  or  a  stopping  condition  is 
met.  The  parameter  estimate  is  then  taken  as  the  value  corresponding  to  the  maximum  of 
the  PDF.  With  the  Gaussian  parameters  obtained  from  the  learning  process,  thresholds 
can  be  computed  for  a  minimum  misclassification  error  or  simply  as  the  average  of  two 
adjacent  Gaussian  mean  estimates. 

3.  Methods  and  experimental  results 

3.1.  Image  Segmentation 

3.1.1.  Learning  automata  for  image  segmentation 

We  consider  the  multi-level  thresholding  problem  for  an  image  [24]  that  has  a  normalized 
histogram  that  is  modeled  as  a  mixture  of  Gaussians.  Each  Gaussian  component  in  this  case  is 
assumed  to  correspond  to  an  object  (or  a  group  of  objects  with  similar  pixel  grayscale  level 
distributions)  of  interest.  Two  key  issues  to  be  addressed  in  this  case: 

a.  the  number  of  components  in  the  GMM  needs  to  be  determined, 

b.  parameter  estimates  of  each  Gaussian  component  are  required  for  threshold 
computations. 

We  adopt  the  method  used  in  [25],  where  a  Gaussian  component  is  detected  with  a  pair  of  zero 
crossings  in  the  second  order  difference  of  the  GMM.  In  an  ideal  Gaussian  density  distribution 
with  a  mean  p  and  a  standard  deviation  a,  this  pair  of  zero  crossings  occurs  at  the  inflection 
points  of  the  Gaussian,  i.e.,  p  ±a.  Since  the  tails  of  the  Gaussian  components  overlap  with  each 
other  to  a  large  extent  in  general,  the  locations  of  zero  crossings  cannot  be  used  to  precisely 
extract  the  individual  component.  We  adopt  an  iterative  learning  procedure,  in  which  locations  of 
the  zero  crossings  are  used  to  initialize  the  search  range  for  a  given  parameter,  and  a  learning 
automaton  [26]  is  employed  to  progressively  refine  the  estimate.  The  Gaussian  component 
detection  and  localization  procedure  outlined  shown  in  Fig.Figure  2:  Gaussian  component 
detection  and  parameter  estimation  with  learning  automataDetailed  description  is  provided  as 
follows. 

a.  Gaussian  Component  Detection 

The  histogram  of  an  image  provides  statistical  information  on  its  pixels  with  different 
grayscale  levels.  It  is  first  normalized  to  be  viewed  as  a  probability  density  function  h(x),  x  G 
[0;  L  —  1],  where  L  is  the  number  of  different  grayscale  levels  used  in  the  image  representation. 
The  normalized  histogram  h(x)  is  then  modeled  as  a  Gaussian  mixture  G(x).  To  avoid  over 
segmentation  histogram  for  a  real  image  is  first  smoothed  by  a  Gaussian  kernel  with  a  chosen 
standard  deviation  r. 

c'w  =  vfeexp(-^)  (11) 

This  gives  a  smoothed  normalized  histogram  /iT(x)  =  GT (x)  *  h(x),  denotes  the 
convolution  operation. 

Zero  crossing  of  the  derivative  approximation  signal  for  /iT(x)  [27]  can  then  be  detected. 
Note  that  the  zero  crossings  should  appear  in  pairs  with  the  sign  of  the  second  order  difference  of 
/iT(x)  changing  from  positive  to  negative  at  the  left  crossing  point,  and  from  negative  to  positive 
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at  the  right  crossing  point.  These  inflection  points  occur  at  p  ±  a  for  an  ideal  Gaussian  G(p,  a), 
and  provide  initial  estimates  of  its  mean  and  standard  deviation.  Each  pair  of  inflection  points 
determines  the  presence  of  a  Gaussian  component.  Suppose  there  are  K  pairs  of  inflection  points. 
The  Gaussian  mixture  model  of  hT(x)  is  expressed  as: 


where,  pit  and  p;  are  the  mean,  the  standard  deviation  and  the  weight  of  the  ith  component  in 
the  mixture  respectively  and  £f=  1  pt  =  1. 

b.  Gaussian  Parameter  Estimation  with  Learning  Automata 

A  learning  automaton  can  be 
considered  as  an  optimizer  that 
attempts  to  determine  an  optimal 
action  (out  of  a  set  of  potential  actions 
a1,a2,  ...,ar)  to  be  applied  to  the 
external  environment  for  the  best 
performance  (maximum  benefit, 
minimum  cost,  etc.),  based  on  a 
progressive  learning  process  from  the 
response  (/?i,/?2>  —  >Pm)  °f  the 
environment,  in  a  stochastic  way  [23] 
as  described  in  section  2.7.  The 
desirability  of  the  action  set  is 
represented  by  probabilities,  which 
are  updated  as  learning  proceeds. 


Experimental  Results  Figure  2:  Gaussian  component  detection  and  parameter  estimation  with 

.  ,  ,i  ,  ,  •  learning  automata 

An  example  of  the  segmentation 

problem  of  a  remotely  sensed  image  Figure  3:  (a)  shows  an  example  of  a  remote-sensing  image  for 
segmentation  and  (b)  shows  the  Gaussian  component  detection  is  provided  next  to  illustrate  the 
descriptions  and  basic  operations  of  the  proposed  thresholding  method.  Here  we  are  interested  in 
automatically  computing  suitable  thresholds  that  can  divide  the  image  into  three  regions:  man¬ 
made  structures,  the  green  area,  and  the  water  area.  In  this  case,  the  image  is  assumed  to  contain 
three  components  of  different  grayscale  level  distributions,  and  two  thresholds  are  sought.  The 
Gaussian  component  detection  procedure  determines  three  pairs  of  zero  crossings,  i.e.,  K=3,  as 
shown  in  Figure  3:  (a)  shows  an  example  of  a  remote-sensing  image  for  segmentation  and  (b)  shows  the 
Gaussian  component  detection.  Therefore,  there  are  eight  Gaussian  parameters  to  be  estimated. 
Figure  4:  (a),  (b)  and  (c)  shows  the  three  different  Gaussian  components  and  the  segmentation  results 
corresponding  to  that  Gaussian  component  shows  the  parameter  learning  process  (evolution  of  the 
PDF)  and  the  final  segmentation  results  for  each  Gaussian  component.  For  this  example,  the  two 
threshold  levels  are  simply  chosen  based  on  the  Gaussian  mean  estimates  with 


Ti  = 


Pi  +  ifo+i 


(13) 


where  ftL  is  the  estimate  of  /ij  ,  the  mean  of  the  ith  Gaussian  component.  The  thresholds  7)  can 
also  be  computed  based  on  other  criteria,  e.g.  minimum  misclassification  error  [28],  using  the 
estimates  of  the  standard  deviations  of  each  Gaussian  component. 
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Smogtfced  hetgram  with  2nd  derivative  zero  crossings  marked 
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2nd  derivative  of  smoothed  histgram 


(b) 


Figure  3:  (a)  shows  an  example  of  a  remote-sensing  image  for  segmentation  and  (b)  shows  the  Gaussian  component 
detection 


3.1.2.  Image  Segmentation  using  dictionary  learning 

Here  we  develop  an  automated  segmentation  algorithm,  where  a  set  of  basis  vectors  is  obtained 
from  an  existing  dataset  of  images.  .  The  main  objective  is  that  instead  of  explicitly  specifying 
the  set  of  basis  elements;  we  estimate  an  optimal  set  of  bases  from  the  set  of  training  images 
using  dictionary  learning.  Similar  images  to  be  segmented  are  expressed  as  a  linear  combination 
of  these  basis  vectors.  As  mentioned  earlier,  the  constant  intensity  model  fails  to  capture  the 
intensity  heterogeneity.  Hence  the  basic  Chan-Vese  model  [2],  is  generalized  to  make  the  model 
more  flexible.  In  order  to  do  so,  we  add  higher  order  terms  which  can  capture  the  intensity 
variations  in  the  regions.  Going  by  the  intuition  of  Chan  and  Vese,  it  is  fair  to  approximate  the 
mean  image  of  a  dataset  as  a  piecewise  constant  image.  The  two  steps  that  comprises  of  the 
segmentation  method  are  as  follows: 

a.  Dictionary  learning 

The  first  step  involves  obtaining  the  basis  elements  for  a  given  dataset  which  gives  a  compact 
representation  of  all  the  images  corresponding  to  that  dataset.  We  employ  K-SVD  based  [21]  the 
dictionary  learning  discusses  in  section  2.2.  We  denote  Dk(x)  =  [di(x), ... ,  dk(x)]T  as  the 
dictionary  learned  from  the  mean  subtracted  images  using  (2). 

b.  Level  set  segmentation  using  learned  dictionary 

Assuming  a  mean  image  which  is  approximately  piecewise  constant,  the  dictionary  atoms 
learned  from  the  mean  subtracted  dataset  can  be  utilized  to  provide  the  non-linear  variation 
necessary  to  model  the  intensity  inhomogeneity  [29].  A  generalized  version  of  Chan-Vese's 
model  can  be  formulated  as  follows: 


r 

k 

2 

r 

k 

£(0,  A,  B)  = 

/(X)-V  M;(X ) 

(x)  dx  + 

fOO-'Y  bidi(x) 

'a 

i= 0 

i= 0 

2 


+  v  jjVHe(<P)\dx  +  \(\\A\\l  +  \\B\\22) 


m2(x)dx 


(14) 
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(a)  Gaussian  component  #1:  Water  area 


(c)  Gaussian  component  #3:  Manmade  structures 


Figure  4:  (a),  (b)  and  (c)  shows  the  three  different  Gaussian  components  and  the  segmentation  results  corresponding 
to  that  Gaussian  component 
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Here  d0(x)  =  1  and  dll ... ,  dk  are  dictionary  elements  or  atoms  which  are  used  to  model  the 
non-linearity  in  the  intra-region  intensities  of  the  images.  The  third  term  in  (14)  introduces 
smoothness  in  the  solution,  which  is  controlled  using  the  parameter  v.  A  =  [a0, ... ,  ak]T ,  B  = 
[b0, ... ,  bkY  are  (k  +  1)  dimension  real  valued  coefficient  vectors.  The  parameter  A  reduces 
over-fitting,  by  constraining  the  l2  norm  of  the  coefficient  vectors. 

In  other  words,  (14)  generalizes  the  traditional  Chan-Vese  technique  by  introducing 
capability  to  handle  heterogeneous  image  regions.  Here  dlt ... ,  dk  can  be  interpreted  as  'detail 
functions'  to  model  the  intensity  variation  in  conjunction  to  the  constant  illumination  term  d0. 
Assuming  a  mean  image  which  is  approximately  piecewise  constant,  the  dictionary  atoms  are 
learned  from  the  mean  subtracted  dataset  and  can  be  utilized  to  provide  the  non-linear  variation 
necessary  to  model  the  intensity  inhomogeneity.  One  can  also  think  of  the  dictionary  atoms  as 
incorporating  higher  order  details,  learned  to  suit  the  dataset 


Experimental  results 

We  use  five  different  sets  of  images  to  evaluate  the  performance  of  our  algorithm.  Out  of  them, 
three  datasets  contain  images  of  medical  phantoms,  which  mimic  human  veins.  The  remaining 
two  datasets  consists  of  human  vein  images,  captured  in  vivo.  Each  dataset  contains 
approximately  18  to  60  images,  captured  in  C-mode  using  a  portable,  battery  operated  ultrasound 
scanner.  The  different  images  in  a  given  set  correspond  to  the  image  of  a  vein  at  various  depths. 
Note  that  each  dataset  consists  of  registered  blood  vessel  images.  The  vessel  orientation  and 
scale  are  also  consistent.  A  separate  dictionary  is  computed  using  the  mean-subtracted  images  for 
each  of  the  datasets. 

Dependency  on  contour  initialization:  We  show  the  performance  of  our  algorithm  using  both 
manual  and  automatic  initialization  methods.  The  segmentation  results  with  manual  and 

automatic  initialization  for  DL2S  [29] 
Chan-Vese  [9]  and  L2S  [30]  are  shown  in 
Figure  5:  Comparison  of  segmentation  results 
using  manual  and  automatic  initialization 
methods,  (a)  initialized  contour  (b)  segmentation 
results  of  Chan-Vese  (white),  (c)  segmentation  via 
L2S  (black)  and  (d)  segmentation  via  DL2S  model 
(yellow)  for  the  same  image.  We  observe 
that  the  segmentation  performance  of  L2S 
drops  significantly  for  automatic 
initialization,  which  is  also  true  for  Chan- 
Vese  method.  In  comparison  DL2S  has 
similar  segmentation  results  for  both 
initialization  techniques.  Quantitative  evaluation  of  performance  based  on  initialization  is 
provided  in  Table  1. 


(a) 


(b) 


Figure  5:  Comparison  of  segmentation  results  using  manual 
and  automatic  initialization  methods,  (a)  initialized  contour  (b) 
segmentation  results  of  Chan-Vese  (white),  (c)  segmentation 
via  L2S  (black)  and  (d)  segmentation  via  DL2S  model  (yellow) 


Table  1:  Quantitative  comparison  of  the  three  methods 


DL2S  [291 

Chan-Vese  [91 

L2S  [301 

Manual 

Auto 

Manual 

Auto 

Manual 

Auto 

0.93  +  0.02 

0.92  +  0.04 

0.91  ±  0.07 

0.86  +  0.11 

0.89  ±  0.09 

0.55  ±  0.17 

0.90  +  0.04 

0.90  +  0.07 

0.88  +  0.12 

0.88  +  0.12 

0.90  +  0.06 

0.88  +  0.12 

0.85  +  0.08 

0.86  +  0.08 

0.80  ±  0.08 

0.85  ±  0.11 

0.85  ±  0.12 

0.84  +  0.09 

0.80  +  0.10 

0.83  ±  0.06 

0.69  +  0.21 

0.73  ±  0.12 

0.70  ±  0.19 

0.60  +  0.14 
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0.76  +  0.16 

0.76  +  0.10 

0.75  ±  0.14 

0.72  ±  0.11 

0.72  ±  0.16 

0.62  ±  0.13 

Quantitative  comparison 
of  segmentation:  Figure  6  shows  the  segmentation  performance  of  Chan-Vese  (white)  [7]),  L2S 
(black)  [18]  and  DL2S  (yellow).  Fig.  4  shows  that  DL2S  is  able  to  capture  the  blood  vessels 
more  appropriately  in  presence  of  severe  contrast  and  intensity  inhomogeneity.  A  quantitative 
comparison  for  five  datasets  is  shown  in  Error!  Reference  source  not  found..  The  Dice 

index,  T>  =  is  evaluated  for  the  three  algorithms.  Here  sg  denotes  the  ground  truth 

segmentation  and  st  is  the  segmentation  result  for  DL2S,  Chan-Vese  or  L2S.  It  is  observed  that 
for  each  dataset,  DL2S  demonstrates  significantly  better  performance  than  L2S  or  CV.  D12S 
achieves  highest  improvement  in  performance  of  above  65%  in  one  dataset  and  42%  in  an  in- 
vivo  dataset.  On  average,  we  observe  increase  in  segmentation  accuracy  by  more  than  12%  for 
the  all  the  datasets. 


Figure  6:  Segmentation  comparison  of  DL2S  [29]  with  Chan-Vese  [9]  and  L2S  [30]  is  shown  here.  The  original  C- 
mode  ultrasound  images  captured  with  a  portable  scanner  are  shown  in  the  first  row.  Segmentation  results  of  Chan- 
Vese  (white),  L2S  (black)  and  DL2S  model  (yellow)  on  these  images  are  shown  in  rows  2,  3  and  4  respectively.  The 
first  four  images  are  phantom  and  the  last  four  are  in  vivo  images  of  blood  vessels 


3.2.  Image  Classification 

Here  we  develop  an  information  theoretic  kernel  combination  method  embedded  in  the 
dictionary  learning  framework.  One  advantage  of  the  kernel  space  representation,  other  than  a 
higher-dimensional  representation,  lies  in  the  fact  that  different  features  can  be  combined  in  the 
kernel  space  using  multiple  kernel  functions  [31].  The  kernel-sparse  representation  techniques 
[15],  [13]  mainly  deal  with  a  single  kernel  in  sparse  representation  or  dictionary  learning 
framework.  Our  method  learns  a  dictionary  in  the  kernel  space  for  each  of  the  classes  in  the 
learning  phase.  We  employ  a  mutual  information  based  approach  to  obtain  the  most  desirable 
weights  for  kernel  combination.  In  the  testing  phase,  our  method  exploits  the  learned  dictionaries 
and  the  kernel  weights  to  assign  a  class  label  to  the  test.  The  steps  involved  in  the  classification 
system  are  shown  in  the  block  diagram. 
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A.  Training  using  multi-kernel  dictionary  learning 


The  training  phase  involves  three  steps:  The  first  step  involves  discriminative  feature  and 
respective  kernel  matrix  computation.  The  next  step  in  training  involves  mutual  information 
based  combination  of  these  features.  Finally,  the  third  step  is  a  classifier  design  using  a 
dictionary  learning  framework. 

a.  Discriminative  feature  and  respective  kernel  matrix  computation 

b.  In  image  classification,  a  crucial  part  involves  selecting  the  most  discriminative  feature  that 
can  differentiate  between  images  belonging  to  different  classes  in  a  dataset.  Here  we  address 
the  feature  combination  problem  by  multi-kernel  dictionary  learning.  Different  features  (as 
explained  in  2.6)  are  computed  for  each  of  the  images  and  the  kernel  matrix  corresponding  to 
that  feature  are  computed.  The  different  kernels  used  in  the  experiment  are  Gaussian 

(K(yi,yj)  =  exp  (— y||yi  -  yj||2)  and  polynomial  (K(yi,yj)  =  yfy),  where  yj  andyj  are  the 

image  featuures  and  y  is  a  scaling  parameter. 

c.  Mutual  information  based  combination  of  features 

d.  This  step  in  the  training  step  involves  learning  the  kernel  weights  £c  =  [px  ... .  p5],  for  each 

of  the  classes  c  =  1 . C.  We  define  the  linear  combination  of  the  kernel  matrix  %  = 

2f=i  PsKsC),  such  that  T'C  )TlF(. )  =  %  as  explained  in  2.3.  If  f  denotes  the  function 
transforming  the  given  feature  vector  to  the  RKHS,  we  can  re-write  the  entropy  for  the 
transformed  feature  as  H(c|f(x)).  The  weights  for  the  kernel  combination  are  hence  obtained 
by  minimizing  the  conditional  entropy  for  a  class  given  by  the  following  equation. 

argmin  H(c\X  =  f(Y,  3  )) 

»  (15) 

s.t  Ef=i  (Bs  =  1  and  Ps  >  0  Vs 

1  C 

We  initialize  ps  =  -  Vs  G  [1 ...  <5],  such  that  2s=i  Ps  =  1  -To  solve  for  Bc,  we  use  a  random 

o 

search  method  [32],  We  randomly  select  weight  values  from  a  Gaussian  distribution  such 
that,  Ps~J\f(Ps-1, 1)  and  normalized  by  £s=i  Ps .  Then  select  the  psvalues  for  which 
H(c|X  =  f(Y,B  ))  is  minimum,  t  denotes  the  iteration  step. 

e.  Multi-kernel  dictionary  learning:  This  step  involves  Updating  dictionary  D  (as  discussed  in 
section  2.2)  and  sparse  codes  X  for  each  of  the  classes  separately.  With  Bc  is  fixed  for,  Dc 
is  initialized  by  randomly  selecting  Kcolumns  of  Yc.  First  keeping  Dc  fixed,  we  update  the 

sparse  codes  Xc  solving  the  following.  Here  Y  =  [yi.y2 . j-»J.  ( n  e  «”*"')■  The 

sparse  codes  for  a  class  can  be  embedded  in  the  matrix  X  =  [xlt x2, . ,  xNc],  Xc  G  RKxNc. 

Let  us  denote  Nc  as  the  number  of  images  in  the  cth  class. 

argmin ||lF(y,c)  -  ©£-1Xct||  j 

4  (16) 
s-t  IMo  <TVi  G  1 ... . Nc 

We  use  orthogonal  matching  pursuit  [33]  to  solve  (10).  Once  the  sparse  codes  are  obtained 
the  next  step  is  to  update  the  dictionary.  Keeping  the  sparse  codes  fixed,  we  solve  the 
following  equation,  with  the  constraint  that  the  columns  of  the  dictionary  will  be 
orthonormal. 
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(17) 


argmin||¥(yc)  -  V^W} 

©C 

The  objective  function  can  re-written  as  follows  and  the  optimized  over  ©£  [14]. 

X(yi.yi)  -  2 xjDTX(Y,yi)  +  x]DTX(Y,  Y)DXi  (18) 

We  use  the  K-SVD  algorithms  [10]  for  the  dictionary  update. 


B.  Testing  phase 

The  testing  part  involves  extracting  similar  types  of  features  as  used  for  training  part  and 
using  the  learned  dictionary  to  identify  the  class  for  the  test  image 


Figure  7:  Block  diagram  of  the  classification  method  showing  the  stages  of  training  and  testing  phase 


The  classification  of  the  test  image  is  performed  based  on  the  minimum  reconstruction  error 
with  respect  to  the  class  dictionaries.  Once  the  feature  vectors  for  the  query  image,  yq  is 

available,  Vc  =  1 . C,  the  kernel  combination  for  the  test  image  is  obtained  as  Kc  = 

f(yq,3c  ),  such  that  Kc  =  T'fT'-..  The  respective  sparse  codes  xcq  corresponding  to  the  class 
dictionary  is  obtained  by  solving  the  following 

argmin  || %(?,)  -  Vcxcq ||*  s.t.  ||*£||o  <  T  (19) 

xq 

The  test  image  is  identified  to  belong  to  the  particular  class  for  which  the  reconstruction  error 
is  minimum  given  by  the  following,  (i{yq)  =  c  )  =  minc  ||TJc(y(?)  —  Vcxq  ||2 
The  details  of  the  algorithm  are  provided  in  Table  2. 
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Figure  8:  Sample  images  from  Caltech  101  dataset 


Experimental  Results 

We  performed  experiments  on  Caltech  101 
dataset  [34],  The  dataset  has  101  categories 
with  about  9000  images.  About  3000  images 
were  used  for  training.  30  images  were  chosen 
at  random  per  class  to  perform  the  training. 

The  rest,  about  6000  images  were  used  for 
testing.  Sample  images  from  the  dataset  are 
shown  in  Figure  8.  We  performed  the 
experiment  with  spatial  pyramid  [35] 
representation  of  scale  invariant  feature 
transform  (SIFT)  descriptors  [36],  with 
Gaussian  (and  polynomial  and  histogram  of 
oriented  gradients  (HOG)  [22]  descriptors 
with  polynomial  kernel.  In  Figure  10,  we  plot 
the  classification  accuracy 

ttno.of  images  classified  correctly\  ^  ^  classes 
Uno.o f  test  images  ) 

in  ascending  order  of  their  classification 
accuracy.  Figure  9  shows  comparison  of 
classification  accuracy  of  MI-MKDL  with 
meta-algorithm  for  feature  nomination  [3]  of 
24  sample  classes.  MI-MKDL  shows  on 
average  an  increase  of  10%  in  classification 
accuracy  with  respect  to  the  meta-algorithm 

4.  Conclusion  and  Future 
direction 

Here  we  have  proposed  two  segmentation 
methods,  a  learning  based  threshold  selection 
method  and  a  learning  based  intensity  modeling  method.  Moreover  we  have  also  proposed 
mutual  information  based  adaptive  feature  combination  method  for  kernel  dictionary  learning. 

In  the  learning-based  thresholding  method  for  grayscale  image  segmentation,  the  image 
histogram  is  first  smoothed  and  normalized  by  a  Gaussian  kernel  and  a  best  fit  by  a  Gaussian 
mixture  to  the  smoothed  histogram  is  then  sought  through  a  learning  process.  In  particular,  the 
parameters  (mean,  standard  deviation,  and  weight)  associated  with  each 

component  in  the  Gaussian  mixture  are  estimated  with  a  set  of  learning  automata.  Thresholds  are 
computed  based  on  these  parameter  estimates. 

For  future  work,  we  propose  to  use  a  multivariate  Gaussian  mixture  as  a  model  of  the  high¬ 
dimensional  color  feature  distribution  (possibly  combined  with  texture  features  and  spatial 


Table  2:  Mutual  Information  based  multi-kernel  dictionary 
learning  algorithm 


Training  Step: 

Input:  Training  data,  Y  =  [Ylf  Y2, . ,YC], 

3  =  [ft1  Initialize  fis  =  j  Vs  E  [1  ...  S]. 

Output:  For  each  class  c,  3C)XC,VC 

Method:  For  class  c  =  1 . C, 

Set  t  =  1 

(i)  Compute  kernel  matrix  X  =  2f=i  Ps^O^  Yc) 

(ii)  Solve  for  3 £  ,  that  minimizes  conditional  entropy. 

argmin  H(c\f(Y,  Sf)) 

P 

s.t.  Tjs=i  Ps  =  1  and  ps  >  0  Vs 

(iii)  Multi-kernel  dictionary  learning 

a.  Sparse  coding  stage,  given  Dct_1, 

argmin ||T(FC)  -  D*"1** \\2P 

4 

s.t  llXjllo  <TVi  el  ,...NC 

b.  Dictionary  update  step,  given  V), 

argmin imrc)-  T>tcXtc\\j 

4 

(iv)  Set  t  =  t  +  1  until  convergence  is  reached 

Testing  Step: 

Input:  Test  data  yq ,  3C,  Vc  Vc  =  1 ...  C 
Output:  Class  label  f(yq) 

Method: 

(i)  Solve  for  sparse  codes  for  the  test  image: 

argmin  ||lFc(y(;)  -T>cxcq ||^  s.t.  ||^||o  <T 

Xq 

(ii)  The  class  label  is  assigned  as 

G’W  =  c  )  =  min  ||Tc(y(?)  -T>cxcqt 

C  ^ 
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information)  for  image  retrieval.  Unlike  existing  image  retrieval  schemes  using  modified  EM 
algorithms,  we  will  extend  the  proposed  learning  automaton  approach  to  cope  with  multivariate 
Gaussian  parameter  estimation  for  color  image  segmentation.  Detected  image  components  can 
then  be  indexed  and  matched  to  a  query  image  for  retrieval  purposes. 


Figure  9:  Classification  accuracy  for  70  classes  from  Caltech  101  dataset 


Figure  10:  Classification  accuracy  comparison  of  MI-MKDL  and  meta-algorithm  for  feature 
nomination 

In  the  second  approach  a  novel  segmentation  method  is  proposed  which  combines  the  idea  of 
dictionary  learning  and  region  based  segmentation  algorithm  in  presence  of  significant  clutter 
and  heterogeneous  intensity.  Furthermore,  it  has  been  shown  using  quantitative  and  qualitative 
results  that  the  proposed  method  outperforms  the  state  of  the  art  in  terms  of  contour  initialization 
and  demonstrates  accurate  segmentation  in  cluttered  images  without  the  use  of  explicit  shape 
priors.  The  proposed  method  requires  a  set  of  images  with  similar  objects  for  the  dictionary 
learning,  which  may  not  always  be  available. 

For  future  work  we  propose  to  learn  a  dictionary  from  image  patches  and  employ  them  to  model 
the  different  regions  of  the  image  for  segmentation  method. 
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In  our  third  approach,  we  present  a  method  of  feature  selection  and  combination  for  robust 
image  classification.  We  first  introduce  a  sparse  representation  based  dictionary  learning 
algorithm  using  kernel  space  feature  representation  and  then  extend  this  representation  by  way  of 
multi-kernel  learning.  The  multi-kernel  learning  allows  feature  combination  and  the  feature 
selection  is  optimized  using  mutual  information,  yielding  weights  for  kernel  combination.  In 
these  preliminary  experiments,  our  method  shows  an  average  increase  of  10%  in  classification 
accuracy  over  that  achieved  by  the  meta-algorithm  for  feature  nomination. 
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